6 research outputs found
Inference with Reference: Lossless Acceleration of Large Language Models
We propose LLMA, an LLM accelerator to losslessly speed up Large Language
Model (LLM) inference with references. LLMA is motivated by the observation
that there are abundant identical text spans between the decoding result by an
LLM and the reference that is available in many real world scenarios (e.g.,
retrieved documents). LLMA first selects a text span from the reference and
copies its tokens to the decoder and then efficiently checks the tokens'
appropriateness as the decoding result in parallel within one decoding step.
The improved computational parallelism allows LLMA to achieve over 2x speed-up
for LLMs with identical generation results as greedy decoding in many practical
generation scenarios where significant overlap between in-context reference and
outputs exists (e.g., search engines and multi-turn conversations).Comment: 9 page
BeamSearchQA: Large Language Models are Strong Zero-Shot QA Solver
Open-domain question answering is a crucial task that often requires
accessing external information. Existing methods typically adopt a single-turn
retrieve-then-read approach, where relevant documents are first retrieved, and
questions are then answered based on the retrieved information. However, there
are cases where answering a question requires implicit knowledge that is not
directly retrievable from the question itself. In this work, we propose a novel
question-answering pipeline called BeamSearchQA. Our approach leverages large
language models to iteratively generate new questions about the original
question, enabling an iterative reasoning process. By iteratively refining and
expanding the scope of the question, our method aims to capture and utilize
hidden knowledge that may not be directly obtainable through retrieval. We
evaluate our approach on the widely-used open-domain NQ and WebQ datasets. The
experimental results demonstrate that BeamSearchQA significantly outperforms
other zero-shot baselines, indicating its effectiveness in tackling the
challenges of open-domain question answering.Comment: Work in progres
LEAD: Liberal Feature-based Distillation for Dense Retrieval
Knowledge distillation is often used to transfer knowledge from a strong
teacher model to a relatively weak student model. Traditional knowledge
distillation methods include response-based methods and feature-based methods.
Response-based methods are used the most widely but suffer from lower upper
limit of model performance, while feature-based methods have constraints on the
vocabularies and tokenizers. In this paper, we propose a tokenizer-free method
liberal feature-based distillation (LEAD). LEAD aligns the distribution between
teacher model and student model, which is effective, extendable, portable and
has no requirements on vocabularies, tokenizer, or model architecture.
Extensive experiments show the effectiveness of LEAD on several widely-used
benchmarks, including MS MARCO Passage, TREC Passage 19, TREC Passage 20, MS
MARCO Document, TREC Document 19 and TREC Document 20.Comment: Work in progres